Introduction

This script analyses what is the effect of choosing a subsect of events from a drought index series in order to perform an interview on drought impacts. The analysis is conducted by creating a synthetic distribution with a known correlation and studying the effect of the choice of events and possible unbiasing methods (e.g. through imputation).

The choice of certain number of events is dictated by the need to ask the interviewee about 8 or so periods indicated by a designated normalized drought indicator, finding if they where seen as drought periods as well as how severe they were compared to other drought periods. Given that often more than 30 drought events can be present in the normalized series, a choice of events is needed to effectively conduct the interview. Given that correlation between the ranking of drought events obtained from the indices and the percieved severity ranking of drought events is of interest, possible bias introduced by this procedure is studied in this script.

Synthetic dataset creation

A synthetic dataset of events is created in order to study many possible cases. The dataset is created through the definition of an appropriate distribution choice and by imposing certain threshold to model the detection of drought periods as such. The dataset creation is based on the hypothesis that drought periods are evaluated by professionals as more/less severe according to an implicit categorization which, while perhaps based on a multitude of inputs and parameters, can be represented as a monodimensional magnitude value, say \(\chi_{P}\). Furthermore, the classification is assumed to have a lower threshold, according to which events under a certain severity are not perceived as drought periods. Finally, it is hypothesized that this implicit ranking bears some correlation with a ranking based on characteristics of drought periods as defined through a normalized index (e.g. SPI, SPEI, SSI) and analyzed through run theory. As such, a composite monodimensional \(\chi_{I}\) value is calculated for each event of an index series.

Given that \(\chi_{P}\) is unknown, for the study of this problem it is synthetically obtained from a distribution, which is defined based on the distribution of \(\chi_{I}\) values obtained from a known index series, assuming the previously stated hypothesis. As such, we must first find a suitable distribution for the \(\chi\) magnitude values.

Choice of drought “magnitude” distribution

Run analysis of normalized index series usually involves the definition of drought periods ar “runs” under a certain threshold (e.g. -1). For each of these runs, the subsequent values are usually defined in the literature:

  • Drought Severity (DS): sum of the index value during the run.

  • Drought Duration (DD): length (in months) of the run.

  • Drought Intensity (DI): DS divided by DD, representing the mean index value during the run.

    Figure 1: example of drought run characteristics.
    Figure 1: example of drought run characteristics.

Given the need for a single “magnitude” value to compare to the hypothesized one for the interviewee, two of these characteristics need to be combined into a single value (given that one of the three is just derived from the other two). The choice is made to rank events based on a multivariate ranking obtained from DD and DI, given that DD and DS are positively correlated as DS is a monotonic increasing function. Both DD and DI values a normalized via a simple linear normalization from 0 to 1 (respectively the minimum and maximum DD and DI values in the dataset) and summed together to obtain a \(\chi_{P}\) value.

For example, this ranking procedure is done on an SPI series from the study area.

The obtained events display the following distribution of the \(\chi_{I}\) value:

The obtained values follow quite closely a normal distribution.

## summary statistics
## ------
## min:  0.1875   max:  1.63918 
## median:  0.8234945 
## mean:  0.8368748 
## estimated sd:  0.3097594 
## estimated skewness:  0.03001125 
## estimated kurtosis:  3.047254

The fitted distribution is used for the definition of both the \(\chi{I}\) and \(\chi{P}\) values.

Definition of a bivariate distribution of magnitude values

The magnitude values from the index series and from the interviewee are simulated as a bivariate normal distribution, where both distributions have the same parameters (given that they are obtained from normalized data).

Simulation of responses

Exclusion of interviewee responses

The interview process is simulated by calculating first eliminating all events under a certain \(\chi{P}\) threshold for the interviewee, simulating the inability to remember events with lower than average characteristics.

This also excludes a certain number of ranks which the interviewee will not remember.

Exclusion for questions

As said before, a number of responses must be excluded in order to be able to conduct the interview. For now these are assumed to be 10 events, chosen as the top 10 from the ranking of \(\chi{I}\) values. This further restricts the number of asked events.

As can be seen already from the examples, such a procedure is likely to not result in a significant correlation value to be obtained, especially if we assume mid-low correlation over the whole dataset.

Empirical correlation

The effects on the non-random choice procedure on data with different underlying correlations are show by plotting the resulting distributions of empirical correlations and their (non) significance.

The distributions clearly show that the procedure makes it very difficult to obtain a significant value from the test, and even then the value obtained is not close to the actual one.

Imputation

Different imputation methods are proposed and tested. These consist of substituting the missing (non-asked) values with different methods in order to better represent the entire dataset.

Median value substitution

The median value of the interviewer’s and interviewee’s rankings, respectively, is substituted for all missing values.

Bayesian Approach

Next, imputation is made via a Bayesian approach by reconstructing the missing values based on the Multivariate Imputation by Chained Equations. First, an example is given: the imputation is performed by using a bayesian linear regression and then doing a post processing on the data, “filling” in the missing ranks from the interview. From these imputed values a correlation and p-value is calculated.

Next, the results from a number of iterations using this method are presented.